Datascience para el Bien Social
  • Home
  • Categories
  • Tags
  • Archives

Aprendiendo a ganar al juego de preguntas “Jeopardy” usando Python y Analisis de datos. (part 1/2)

Aprendiendo a ganar al juego de preguntas “Jeopardy” usando Python y Analisis de datos. (part 1/2)¶

Jeopardy! es un concurso de televisión estadounidense creado por Merv Griffin. Es un concurso de conocimientos con preguntas sobre numerosos temas como historia, idiomas, literatura, cultura popular, bellas artes, ciencia, geografía, y deportes. Consiste en que uno de los tres concursantes elige uno de los paneles del tablero de juego, el cual, al ser descubierto, revela una pista en forma de respuesta; los concursantes entonces tienen que dar sus respuestas en forma de una pregunta.¶

  • En este proyecto vamos a ver como:
    • Normalizar el texto.
    • Buscar respuestas en las propias preguntas.
    • Buscar preguntas que se suelen repetir. (Intentando ganar de forma facil :)
    • Preguntas con alto y bajo valor. Maximizando los puntos que obtenemos.
    • Aplicando “Chi-squared test”
In [2]:
import pandas as pd
import csv

jeopardy = pd.read_csv("jeopardy.csv")

print(jeopardy.head())
   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
In [3]:
print(jeopardy.head())
   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
In [4]:
jeopardy.columns
Out[4]:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
In [5]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.columns
Out[5]:
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Normalizando el texto.¶

In [6]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    
    try: 
        text = int(text)
    except Exception:
        text = 0
    return text
        
In [7]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
In [8]:
print(jeopardy.head())
   Show Number    Air Date      Round                         Category Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  clean_value  
0  for the last 8 years of his life galileo was u...   copernicus          200  
1  no 2 1912 olympian football star at carlisle i...   jim thorpe          200  
2  the city of yuma in this state has a record av...      arizona          200  
3  in 1963 live on the art linkletter show this c...    mcdonalds          200  
4  signer of the dec of indep framer of the const...   john adams          200  
In [9]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
In [10]:
jeopardy.dtypes
Out[10]:
Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

Buscando respuestas en las preguntas.¶

In [11]:
def answ_in_quest(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    
    match_count = 0
    
    if "the" in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(answ_in_quest, axis=1)
    
In [12]:
jeopardy["answer_in_question"].mean()
Out[12]:
0.060493257069335914

El porcentaje de encontrar una respuesta en la propia pregunta es del 6%, es un porcentaje pequeño para ganar a este juego pero si algo a tener en cuenta.¶

Buscar preguntas que se suelen repetir. (Intentando ganar de forma facil :)¶

In [13]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()
Out[13]:
0.6925960057338565

Aqui tenemos una media del 70%, o cual significa que es bastante ventajoso mirar preguntas pasadas ya que de cada 10 preguntas 7 seran pasadas, es un alto porcentaje . Debemos de ser cuidadoso aqui, ya que lo que hemos hecho es mirar palabra por palabra y ver las coincidencias para decidir si esa misma pregunta estaba realizada anteriormente, es decir, que una variacion de la pregunta que implique otra respuesta puede haberse dado. De cualquier forma, trabajar con las preguntas ya realizadas con anterioridad es una ventaja importante.¶

Preguntas con Alto valor vs Bajo valor.¶

In [14]:
def assign_values(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(assign_values, axis = 1)
In [15]:
def takes_a_word(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
In [16]:
observed_expected = []

comparison_terms = list(terms_used)[:5]
In [17]:
print(comparison_terms)
['targetblankpierced', 'fingers', 'potawatomi', 'mid1984', 'merged']
In [18]:
for term in comparison_terms:
    observed_expected.append(takes_a_word(term))
In [19]:
observed_expected
Out[19]:
[(1, 0), (5, 5), (1, 1), (0, 1), (5, 3)]

Apicando “The Chi-Squared Test”¶

In [20]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
print(high_value_count)
5734
In [21]:
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
In [22]:
chi_squared = []

from scipy.stats import chisquare
import numpy as np
In [23]:
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / len(jeopardy)
    exp_high_value = total_prop * high_value_count
    exp_low_value = total_prop * low_value_count
    
    chi_squared.append(chisquare(np.array([obs[0], obs[1]]), np.array([exp_high_value, exp_low_value])))
In [24]:
chi_squared
Out[24]:
[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.2243874083063973, pvalue=0.13584652879916373),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=4.4765585681292279, pvalue=0.034362848042873227)]

We will continue in part 2 trying different model to get some more insights and se if they can work better than chi-Squared.¶

Continuaremos con una segunda parte donde intentaremos diferentes modelos para obtener mas conclusiones y trabajar en profundidad con los distintos modelos de predicción.¶

In [ ]:
 

Published

jul. 4, 2017

Category

Statistics

Tags

  • Linear Regression 2
  • Python 10
  • Statistics 4

Stay in Touch

Get Monthly Updates

  • Powered by Pelican. Theme: Elegant by Talha Mansoor